Skip to content

Conversation

@ayeganov
Copy link
Contributor

This PR refactors the project from a monolithic script into a well-defined, reusable library. The core training, data handling, and tokenizer management logic have been extracted from train.py into decoupled, object-oriented components. The goal is to create a clean API that can be easily used and extended in other projects.

The train.py script is now a simple command-line client that demonstrates how to use the new library components.

Key Changes ✨

  • Trainer Class: A new scratchgpt/training/trainer.py module introduces the Trainer class, which now encapsulates all logic for training loops, validation, pre-tokenization, and model checkpointing.
  • DataSource Protocol: A new, flexible scratchgpt/data/datasource.py module defines a protocol for data loading. We've included concrete FileDataSource and FolderDataSource implementations, replacing the old TextProvider classes.
  • Refactored Tokenizer I/O: The get_tokenizer function in scratchgpt/model_io.py has been updated to use a factory pattern. This makes creating a default tokenizer more robust and explicit.
  • Modularized Tests: The single, large test file has been broken down into smaller, focused modules under the tests/ directory (test_tokenizer_io.py, tests/tokenizers/..), improving maintainability.
  • Upgraded CLI: The train.py script now uses a --tokenizer argument to dynamically load any tokenizer from the Hugging Face Hub, making it significantly more versatile.

Highlights for Review 🔍

When reviewing, please pay special attention to:

  1. The Trainer API: This is the new heart of the library. Is its interface clear? Does it correctly encapsulate the training logic?
  2. The DataSource Protocol: This is our core data abstraction. Is it flexible enough for future use cases?
  3. get_tokenizer Factory Pattern: Review the new signature in model_io.py. This is a key design pattern for how we manage object creation.
  4. The New train.py: As the first client of our new library, does it demonstrate a clean and intuitive workflow?

@ayeganov ayeganov self-assigned this Sep 11, 2025
@dariocazzani dariocazzani merged commit 03fe4e1 into main Sep 12, 2025
3 checks passed
@dariocazzani dariocazzani deleted the feat/easy_to_use_interface branch September 12, 2025 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants